Extension of Analytic Hierarchy Model for High-Efficiency Clustering in Software Defect Prediction

نویسنده

  • Wenshuai Wu
چکیده

Many validity measures and learning algorithms have been applied for quantitatively evaluating the performance of clustering algorithms. However, existing validity measures and clustering evaluation are based on a single realization, and lack of generality. These tasks involve more than one criterion, so it can be modelled as a multiple criteria decision making (MCDM) problem. In this paper, we extend our previous proposed analytic hierarchy model (AHM) to select the best clustering algorithm for high-efficiency clustering in software defect prediction. This extension of AHM is examined and verified by an experimental study using six clustering algorithms, eight validity measures, four MCDM methods and three software defect data sets from the NASA Metrics Data Program repository. The results by a comparative study and visual analysis demonstrate that our extension of AHM is an efficient tool for selecting the best clustering algorithm suitable for a given data set, and provides a universal framework for improving rapidly clustering efficiency. 1. Background and Motivation Clustering analysis, the most widely adopted unsupervised pattern technique, partitions the original data space into groups which have high intra-group similarities and inter-group dissimilarities, without a priori information. Therefore, the purpose of clustering is to identify natural structures in a dataset (Mirkin, 2005), and clustering analysis has been playing an important role in many fields such as medicine, psychology, sociology, image processing and pattern recognition (Pensa and Boulicaut, 2008; X. Wu and Kumar, 2009). In general, there are some fundamental questions that need to be addressed in any typical clustering scenario, such as: (1) What is a good clustering technique suitable for a given data set? (2) How many clusters are actually presented in the data? This paper focuses on the first problem. Clustering analysis is an important research problem in data mining. There are some extensive and good overviews of clustering algorithms in the literature (Gunnemann et al., 2010; Davidson and Ravi, 2005). However, Naldi et al. (Naldi et al., 2013) pointed that different clustering algorithms may produce different partitions, so how to select a good clustering technique or algorithm for a given data set is a challenging problem, because of the application-dependent nature of clustering. In this article, our previous proposed AHM (Kou and Wu, 2014a) is extended to evaluate clustering algorithms for selecting a robust clustering technique or algorithm for a given data set. According to the Research Triangle Institute (RTI), software defects cost the U.S. economy billions of U.S. dollars annually, and more than a third of the costs associated 14 Wenshuai Wu: Extension of Analytic Hierarchy Model for High-Efficiency Clustering in Software Defect Prediction with software defects could be avoided by improving software testing. Software defects have become the de facto industry standard. Software defect prediction can be modelled as a data mining problem by categorizing software units as either fault-prone (fp) or non-fault-prone (nfp) using historical data (Peng et al., 2011). So, in our empirical study, three software defect data sets from the National Aeronautics and Space Administration (NASA) Metrics Data Program repository (Chapman et al., 2004), cleaned by Shepperd et al. (Shepperd et al., 2011) are selected for clustering validation. For clustering evaluation, many validity measures have been proposed. Xie and Beni (1991) proposed a fuzzy validity criterion based on a validity function. Chou et al. (2004) proposed a CS cluster validity measure that can deal with clusters with different densities or sizes. Zhang et al. (2008) proposed a cluster validity measure based on variation and separation for the validation of partitions of object data produced by the fuzzy c-means algorithm. Zalik (2010) proposed a CO cluster validity measure based on compactness and overlapping measures to estimate the quality of partitions. Saha and Bandyopadhyay (2012) proposed a measure of connectivity incorporated seven cluster validity indices, which are able to automatically detect clusters of any shape, size or convexity as long as they are well-separated. So many validity measures have been proposed for clustering evaluation. Which one is the best? A number of proposed validity measures are considered and can be classified into three categories: external, internal, and relative measures (2007). However, existing validity measures and clustering evaluation are based on a single realization, and lack of generalizability. For solving these problems, in this paper, our previous proposed AHM (Kou and Wu, 2014a) is extended for clustering evaluation. Since the evaluation of clustering algorithms involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. So this paper extends our previous proposed AHM (Kou and Wu, 2014a) for clustering algorithm evaluation. It consists of three stages. Firstly, in the DM stage, six most influential clustering algorithms are used for task modelling in three software defect data sets. Two of three are selected randomly as a training set, and the leaving one is taken as a test set. Secondly, in the MCDM stage, four classical MCDM methods are applied to measure the initial performances of clustering algorithms, eight validity measures as input. Finally, in the secondary mining stage, multi-attribute utility theory (MAUT) with consensus decision-making is proposed to determine the best performance for selecting the most robust clustering algorithm among three datasets. The three datasets are from the National Aeronautics and Space Administration (NASA) Metrics Data Program repository. Burn et al. (2007) presented that some external measures for evaluating clustering results perform better than internal and relative measure. So this paper selects eight external validity measures for clustering algorithm assessment. The contributions of this paper are threefold. The principal contribution is to extend our previous proposed AHM (Kou and Wu, 2014a) for evaluating the performance of cluster algorithms to select the best clustering technique for a given data set, which is one of two hot issues in clustering validity. Second, this paper carries out secondary knowledge discovery with MAUT to improve the degrees of agreements on clustering algorithm efficiency for decision optimization among multi-datasets. Thirdly, the software defect problem can be researched and resolved using clustering analysis, from a new and important research perspective, not traditional classification prediction. The remaining parts of this paper are organized as follows: Section 2 describes our technical solutions, the clustering algorithms, the selected MCDM methods, the validity measures, and MAUT. Section 3 presents details of the experimental study and analyzes the results. Section 4 declares some significance and impact of our work. Section 5 summarizes the paper. 2. Technical Solutions and Methods 2.1. Technical Solutions In this section, our previous proposed AHM (Kou and Wu, 2014a) is extended for evaluating and selecting the best clustering algorithms. As an essential step in cluster analysis, cluster validation is necessary to ensure that the clustering structures are not occurred by chance (Jain et al., 1999). As one of two hot issues in cluster validation, how to select a best clustering technique suitable for a given data set is a challenging problem. Many validity measures have been proposed in an attempt to identify the best clustering effectiveness. For the evaluation of clustering algorithms normally involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. However, there are situations where different MCDM methods produce conflicting rankings and the results learning algorithms provide could have a large knowledge gap (Domingos, 2007) with anticipation, experience, and expertise based on them, which makes it hard to make decisions accurately and effectively (Kou and Wu, 2014a). And existing validity measures and clustering evaluation are based on a single realization, and lack of generality. To deal with the above problems, this paper carries out secondary knowledge discovery with MAUT in an attempt to find out the best clustering algorithms for high-efficiency clustering. The extension of AHM for clustering evaluation consists of three stages: DM stage, MCDM stage and Secondary Mining (SM) stage. The framework is presented in Fig 1. In the first stage, DM stage, six clustering algorithms which are the most influential clustering algorithms are selected for task modelling, in three software defect data sets. Two of three are selected randomly as a training set, and the leaving one is taken as a test set. The selected clustering algorithms are Expectation Maximization (EM), Farthest First (FF), Filtered Clusterer (FC), Hierarchical Clusterer (HC), Make Density International Journal of Management Science 2015; 2(2): 13-20 15 Based Clusterer (MDBC), Simple K-Means algorithm (KM) (Han and Kamber, 2006). In the second stage, named as MCDM stage, four classic and important MCDM methods (i.e., SWM, TOPSIS, PROMETHEE II and Gray Relational Analysis) are applied to provide an initial ranking to measure the performances of clustering algorithms. In this stage, MCDM methods consider different clustering algorithms as alternatives and the outputs of any clustering algorithm on validity measures as criteria. All these MCDM methods are implemented using MATLAB 7.0. In the third stage, SM stage, result revelation with MAUT is presented and applied to measure the best performance of top-ranked clustering algorithms, in order to determine a final ranking. MAUT is a structured methodology designed to handle the tradeoffs among multiple objectives. Here MAUT is applied in clustering algorithm evaluation and selection for identifying the best clustering algorithms for a given data set, which can effectively improve the degrees of agreements on clustering algorithm efficiency for decision optimization. Fig 1. The extension of analytic hierarchy model for high-efficiency clustering 2.2. Clustering Algorithm Clustering is a popular unsupervised pattern technique which is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. The goal of clustering algorithms is to partition a data set by looking for a finite collection of clusters according to similarities between its objects (Naldi et al., 2013). In general, the major clustering methods can be classified into the following categories: Partitioning methods, Hierarchical methods, Density-based methods, Grid-based methods and Model-based methods. There are several clustering algorithms reported in the literature, such as k-means algorithm, k-medoids algorithm, divisive approach, DBSCAN, STING, EM, SOM, Frequent pattern-based clustering, Constraint-based clustering (Han and Kamber, 2006). In the DM stage of extension of AHM, six clustering algorithms are chosen for the empirical study. They are the Expectation Maximization algorithm, Farthest First, Filtered Clusterer, Hierarchical Clusterer, Make Density Based Clusterer and Simple K-Means algorithm, which are implemented using WEKA (Waikato Environment for Knowledge Analysis), a free machine learning software (Hall and Frank, 2009).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of the Utility of Economic Sectors in Achieving Agricultural Development: Applying an Analytic Hierarchy Process

ABSTRACT- According to article 44 of the I.R. of Iran Constitution, the Iranian economy consists of three sectors; the state, the cooperative, and the private sectors. The aim of this study was to present a hierarchy of criteria for selecting the best economic sector for agricultural development. Analytic Hierarchy Process (AHP) was used to calculate the relative importance of either criteria o...

متن کامل

Hierarchy Style Application in Line Extension with Responsive Loads Evaluating the Dynamic Nature of Solar Units

This paper presents a model for line extension scheduled to participate in responsive loads in the power system aiming the improvement of techno-economical parameters. The model is studied with the presence of photovoltaic generators that produce variable power depending on the geographical condition. The investment cost of the transmission expansion plan, demand response operation cost, genera...

متن کامل

Improving the Ecological Sustainability by Applying the Appropriate Cultivars of Rice: Using AHP

This study is a survey research. The population of the study was all specialists who have enough data about ways of sustainability rice cultivars that they were identified and studied through Non-probability sampling (purposeful and snowball). To determine the validity of the questionnaire, face and content validity was used, and, to assess the reliability, inconsistency coefficient was used. I...

متن کامل

A two-stage model for ranking DMUs using DEA/‎AHP‎

‎In this paper, we present a two-stage model for ranking of decision making units (DMUs) using interval analytic hierarchy process (AHP). Since the efficiency score of unity is assigned to the efficient units, we evaluate the efficiency of each DMU by basic DEA models and calculate the weights of the criteria using proposed model. In the first stage, the proposed model evaluates decision making...

متن کامل

پهنه‌بندی گذرگاه‌های بهمن خیزاستان‌کردستان

Risk is an inevitable part of life, every day people are somehow at risk. Different risks in various forms and perspectives have different functions. Kurdistan province, with various heights and relatively good rainfall, It results the country's cold spots. Since most of seasonal rainfall occurs in winter, Snow cover is often the domain and passes it hillsides. One of the concerns of people in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015